llama-3.2-3B-instruct
18.7% Overall Accuracy
Answer Key: claude-opus-4-5-20251101
Boundary Models: 20
Pairs: 190
Total Rollouts: 950
Max Turns: 5
Question Difficulty Distribution
47.1%
18.7%
34.2%
Too Easy (447)
Calibrated (178)
Too Hard (325)
Pairwise Accuracy Matrix
Conversation Explorer
Pair Accuracy: --
Conversation 1 of 0
💬

Select a model pair to view conversations